Software  Engineering  Challenges 
for  Parallel  Processing  Systems 


Lt  Col  Marcus  W  Hervey,  USAF 

AFIT/CIP 

marcus.hervey@us.af.mil 


Disclaimer 


The  views  expressed  in  this  presentation  are 
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Research  Directions 


Summary 


From  Moore's  to  Cores 


Before  sequential  programs  were  made  faster  by  running  on 
higher  frequency  computers  without  changes  to  the  code 

Chip  manufacturers  ran  into  problem  with  continuing  down 
this  path 

—  Heat  generation 

-  Power  consumption 

Redefined  metric  from  processor  speed  to  performance  (#  of 
processors/cores) 

Today  optimum  performance  will  require  significant  code 
changes  with  the  knowledge  to  develop  correct  and  efficient 
parallel  programs 


What's  All  the  Fuss  About? 


Matrix  Multiply  using  OpenMP 


Jacobi  using  OpenMP 


Parallel  Processing: 

*  Solves  problems  faster  or  solves  larger  problems 

•  More  complex  -  Must  match  best  algorithm  with  best 

programming  model  and  best  architecture 


Applications  of  Parallel  Computing 


Embedded  Systems 

—  Cell  phones.  Automobiles,  PDAs 

Gaming  Systems 

—  Playstation  3,  Xbox  360 

Desktop/Laptops 

—  Dual-core/Quad-core 

Supercomputing  (HPC/HPTC/HEC) 

—  www.top500.org 

Parallel  Processing  is  mainstream! 


Military  Applications  of 
Parallel  Computing 


Training,  Systems 

Simulation 


The  New  Frontier 


Standard  Architectures 

—  Beowulf  Clusters  /  Grid  Computing 

—  Dual-core/Quad-core  -  Intel/AMD 

—  Intel's  80-core  machine 

Non-standard  Architectures 

—  72-core  machine  -  Sicortex 

—  FPGAs  -  Field-programmable  gate  array 

—  GPGPUs  -  Nvidia,  AMD  (ATI) 

—  Cell  Processor  -  IBM  -  Playstation  3 

—  Accelerators  -  Clearspeed 


Parallel  Processing  Architectures 


Distributed  Memory 


Processor 
Processor  I  Memory  I  Processor 
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Shared  Memory 


Processor  I  Processor  I  Processor 
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...there  is  also  Distributed  Shared  Memory 


Message  Passing  Model 

Communicates  by  sending/receiving  messages 


Process  1  Process  2 


OpenMPI 

MPICH 


Designed  for  Distributed  Memory  Machines 


OpenMPI  Code  Example 


#include  <stdio.h> 

#include  <mpi.h> 
int  main(int  argc,  char  **argv)  { 
char  buff  [20];  int  myrank; 

MPI_Status  status; 

MPI_Init(&argc,  &argv); 

MPI_Comrn_rank(MPI_COMM_W ORLD,  &myrank) ; 
if  (myrank  ==  0)  { 
strcpy(buff,  “Hello  World  !\n”); 

MPI_Send(buff,20,MPI_CH  AR,  1 ,99,MPI_COMM_W  ORLD) ; 

i 

else  { 

MPI_Recv(buff,20,MPI_CHAR,0,99,MPI_COMM_WORLD,&status); 
printf(“received  :%s:\n”,  buff); 

} 

MPI_Finalize(); 
return  0; 

i 


Shared-Memory  Model 

Communicates  by  accessing  shared  memory 


Processor  Memory  Processor 


Write  data  Read  data 

•  OpenMP  programming  model 

•  POSIX  Theads  (Pthreads) 

•  Unified  Parallel  C 


OpenMP  Fork-join  Pattern 


/ 

Parallel  Regions  x 


OpenMP  Code  Example 


Without  OpenMP 

With  OpenMP 

#inc  lude<  s  tdio .  h> 

#include  <stdio.h> 

int  main(void) 

{ 

printf(“Hello  World  !\n”); 
return  0; 

i 

|  #include  <omp.h> 
j  int  main(void) 

j  { 

int  threadid  =  0; 

1  #pragma  omp  parallel  private(threadid) 

{ 

|  threadid  =  omp_get_thread_num(); 

1  printf(“%d  :  Hello  World  !\n”,  threadid); 

|  I 

1  return  0; 

1  1 

Implemented  as  C/C++/Fortran  language  extensions 

Composed  of  compiler  directives,  user  level  runtime  routines,  environment  variables 
Facilitates  incremental  parallelism 


Pthreads  Code  Example 

#include  <stdio.h> 

#include  <pthread.h> 
define  NUM_THREADS  5 

void  *  Hello  World(  void  *threadid)  { 
printf(“%d  :  Hello  World  !\n”,  threadid); 
pthread_exit(NULL) ; 

} 

int  main()  { 

pthread_t  threads  [NUM_THRE  ADS]; 
int  re,  t; 

for  (i=0;  i<NUM_THREADS ;  i++)  { 
printf(“%d  :  Hello  World !\n”,  i); 

re  =  pthread_create(&threads[i],  NULL,  Hello  World,  (void  *)  t); 
if  (re)  { 

printf(“ERROR;  return  code  from  pthread_create()  is  %d\n”,  re); 
exit  (-1); 

} 

} 

pthread_exit(NULL); 

} 


UPC  Code  Example 


Without  UPC 


With  UPC 


#inc  lude<  stdio .  h> 

int  main(void) 

{ 

printf(“Hello  World  !\n”); 
return  0; 

} 


#include  <stdio.h> 

#include  <upc.h> 
int  main(int  argc,  char  *argv[]) 

{ 

int  i; 

for(i=0;  i<THREADS;  i++) 

{ 

if  (i==MYTHREAD) 

{ 

printf(“%d  :  Hello  World !\n”,  MYTHREAD); 

} 

return  0; 

} 


Major  Parallel  Programming  Challenges 


•  Parallel  Thinking/Design 

-  Identifying  the  parallelism 

-  Parallel  algorithm  development 

•  Correctness 

-  Characterizing  parallel  programming  bugs 

-  Finding  and  removing  parallel  software  defects 

•  Optimizing  Performance 

-  Maximizing  speedup  and  efficiency 

•  Managing  software  team  dynamics 

-  Complex  problems  require  large,  dispersed,  multi-disciplinary 
teams 


A  Different  Species  of  Bugs 

Data  Races 

-  When  an  interleaving  of  threads  results  in  an 
undesired  computation  result 

Deadlock 

-  When  two  or  more  threads  stop  and  wait  for  each  other 

Priority  Inversion 

-  A  higher  priority  thread  is  preempted  by  a  lower  priority  thread 

Livelock 

-  When  two  or  more  threads  continue  to  execute,  but  make  no 
progress  toward  the  ultimate  goal 

Starvation 

-  When  some  thread  gets  deferred  forever 


Data  Race  Example 


Without  Synchronization 

Thread  A  Thread  B 

Y 

read  count  =  2 

Y 

count  +  2  =  4 

Y 

write  count  =  4 

read  count  =  4 

Y 

count  +  2  =  6 

Y 

write  count  =  6 

Data  Race  ^ 


With  Synchronization 

Thread  A  Thread  B 


Y 


read  count  =  2 


Y 


count  +  2  =  4 


V 


write  count  =  4 


Y 


V 


read  count  =  2 


V 


count  +  2  =  4 


V 


write  count  =  4 


Y 


This  type  of  error  caused  by  Therac-25  radiation  therapy  machine  resulted  in  5  deaths 


Deadlock 


MPI 

Example 


PROCESS  1 

Send  (Processor  2) 
Receive(Processor  2) 


PROCESS  2 

Send(Processor  1) 
Receive(Processor  1) 


Waiting  on  Process  2 
to  receive  message 


Waiting  on  Process  1 
to  receive  message 


OpenMP 

Example 


worker  ()  { 

#pragma  omp  barrier 

} 

main  ()  { 

#pragma  omp  parallel  sections 

{ 

#pragma  omp  section 
worker(); 

} 

} 


Synchronization  Errors 


Not  Enough 


Data  Races 


Too  Much 


Deadlock 


•  Missing  or  inappropriately  applying  synchronization 
can  cause  data  races 

•  Applying  too  much  synchronization  can  cause 
deadlock 


Priority  Inversion 

•  Lower  priority  thread  preempts  higher  priority  thread 

—  Low-priority  thread  enters  critical  section. 

—  High-priority  thread  wants  to  enter  critical  section,  but 
can't  enter  until  low-priority  thread  is  complete. 

—  Medium-thread  pre-empts  higher  priority  thread 

•  This  type  of  error  caused  Mars  Pathfinder  failure 


Parallel  Performance 


Execution  time 

—  Time  when  the  last  processor  finishes  its  work 

—  Amdahl's  Law  -  Sequential  portions  of  code  limit  speedup 

•  Most  parallel  codes  have  sequential  portion(s) 

Speedup 

—  (1  CPU  execution  time)/  (P  CPUs  execution  time) 

—  Must  compare  to  the  best  sequential  algorithm 

Efficiency 

—  Speedup/P 

—  100%  efficiency  is  hardly  ever  possible 


Parallel  Performance  Metrics 


Performance 
of  Jacobi 
using  OpenMP 


Execution  Time 
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For  optimum  performance,  parallel  developers  need  to  have 
an  understanding  of  the  application  and  the  architecture 


Parallel  Software  Quality  Goals 


Correctness,  Robustness  and  Reliability 
Performance 

-Speedup,  Efficiency,  Scalability,  Load  Balance 

Predictability  -  Cost,  Schedule,  Performance 

—  Managing  complexity  of  harder  problems  with 
more  non-standard  architectures  and  more 
diverse  teams 

Maintainable 


Lack  of  Parallel  Software 


Council  on  Compe  titiveness  Study  Reveals  Lack  of  Software 


http:Awww.hpcwire.co  in^hpc/  6 1  2904.html 


Home  Paje  I  Free  Subscription  I  Advertising  I  About  HPCwire 

Software: 

Council  on  Conipc titiveness  Study  Reveals  Lack  of  Sollware 


The  Council  on  Competitiveness,  a  national  organ  izati  on  of  business,  academic  and  labor  executives,  lus  re  leased  the 
seccxid  part  of  a  study  that  reveals  that  the  lack  of  scalable  application  software  is  preve nting  many  companies  from 
using  high  performance  computing  (HPQ  more  aggressively  for  competitive  advantage.  Part  B  of  the  Council  an 
Competitiveness  Study  of  ISV s  Serving  the  High  Performance  Computing  Market  concludes  that  major  U.S.  industries 
often  cannot  get  the  application  software  the y  need  to  drive  innovati on  and  gj ohal  competitiveness.  Both  ports  of  the 
pioneering  study  were  sponsored  by  tfse  Defense  Advanced  Research  Projects  Agency  i  DA  RPA  1  and  conducted  by 
lead ing  market  research  firm  IDC. 

Part  A  of  this  study  revealed  that  the  independent  software  vendor(lSV)  business  model  for  developing  advanced 
application  software  for  H PC  has  nearly  evaporated .  and  that  ISV s  must  focus  meet  of  their  software  development  of  tfse 
broader  oommerc ial  market. 

"  1  bis  study  demonstrates  that  the  Lack  of  production  quality  H PC  application  software  is  a  soft  spot  in  the 

competitiveness  armor  of  the  U.S."  said  Council  on  Competitiveness  Pre  si  dent  Deboreb  1 _  Wince-Smith.  *  When  U.  S. 

industries  can  not  obtain  the  application  software  they  want  and  need,  innovation  is  stymied  and  competitiveness  is 
compromised.  Fortunately,  we  are  finding  that  most  ISVs  and  a  substantial  portion  of  U.S.  businesses  are  willing  to 
partner  with  each  other,  as  we  1 1  os  universities  and  national  laboratories  to  speed  progress  in  addressing  this  challenge.  * 

" F*art  B:  End  User  Perspectives"  directly  surveyed  a  select  group  of  Highly  experienced  HPC  users  in  U.S .  businesses, 
representing  a  wide  range  of  industries,  from  defens?  toentertainment  to  consumer  products.  1  be  study  revealed  the 
CJ.S.  business  requirements  for  advanced  HPC  application  software,  and  the  financial  and  technical  obstacles  blocking 
firms  front  obtaining  it.  The  perspectives  given  by  these  experienced  users  echoed  many  of  the  findings  from  the 
Council’s  recently  released  software  workshop  report  "Accelerating  Innovation  for  Competitive  Advantnge:  The  Need 
for  HPC  Application  Software  Solutions." 

A  comparison  of  the  key  findings  from  Parts  A  and  B  is  found  in  the  f ol lowing  chart.  The  findings  reveal  the  need  for 
mere  aggressive  use  of  HPC  in  American  business  and  the  current  plans  ISTVs  have  to  meet  these  needs.  The  limitations 
of  m*CT—  specific  ISV  application  software  are  not  the  only  harrier  to  fuller  exploitation  of  HPC  but  are  regularly  cited  by 
industrial  end  users  as  the  most  important  constraint. 

Study  Psrt  At  Current  ISV  IVlarket  I>ynamlcs 

•  The  business  model  for  HPC-  specific  application  software  has  all  but  evaporated  in  the  last  decade. 

•  ISV  applications  can  exploit  only  a  fraction  of  the  problem-solving  powe  r  of  today's  hi gh- perf  ormance  computers. 

•  For  many  applications,  the  ISVs  kn ow  how  to  improve  scalability  hut  have  uo  plans  to  do  so  because  the  HPC 
market  is  too  small  to  Justify  the  R&D  investment. 

•  There  is  a  lack  of  readiness  among  ISV  suppliers  for  petascale  systems. 

•  Market  forces  alone  will  not  address  the  gap  between  HPC  users’  needs  and  ISV  software  capabilities. 

•  Most  ISV*  would  be  wi Hi ng  to  partner  with  outside  parties  to  accelerate  application  software  development. 

Study  Psrt  B:  HPC  End  Users’  1^ rspecth  es 

•  HPC-specific  ISV  application  software  is  indispensable  for  U.S.  industrial  compe  titiveness. 

•  Virtually  all  of  the  firms  said  they  have  Larger  problems  that  they  can't  solve  today. 

•  The  lack  of  scalable  application  software  is  preventing  many  industrial  users  from  using  HPC  more  aggressively 
for  competitive  advantage. 
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The  Need  for  Software  Engineering 


Outcomes  of  over  9000  sequential 
development  projects  completed  in  2004 

Canceled 

Completed 

late,  over  ■Unsuccessful 

budget,  ■  Successful 

and/or  with  B  Canceled 

features 

missing53% 

Source:  [Hayes,  Frank,  “Chaos  is  Back,”  Computerworld,  November  8,  2004.] 


Software  engineering  is  needed  to  create  an 
environment  for  the  development  of  quality  parallel  software 
(reliable,  predictable  and  maintainable) 


Parallel  Software  Engineering 

Process 


Defined,  Repeatable 


Technology  People 

Eclipse  Parallel  Tools  Platform,  Technical  and  ProcessTraining, 

Thread  Analyzer,  Thread  Checker,  Discipline 

DDT,  Total  view 

Result :  Predictable  Cost,  Schedule  and  Performance 


Software  Life  Cycles 


Sequential 

Development  Methodology 


Requirements  Analysis 
Design 

Implementation 

Testing 

Deployment 


Parallel 

Development  Methodology 


Requirements  Analysis 

Design 

Sequential 

Code  Profiling 
Parallel  Design 

Parallel  Implementation 
Testing 

Code  Optimization 
(Tuning) 


Deployment 


Patterns  for  Parallel  Programs 


Decomposing  the  problem  to 
exploit  concurrency 


Structuring  the  algorithm  by  tasks, 
data  decomposition  or  by  flow  of 
data 

Defining  the  shared  data  structures 
that  support  algorithm 
implementation 

Implementing  management, 
communication  and 
synchronization 


Source:  [T.  A.  Mattson,  B.  Sanders  and  B.  Massingill.  Patterns  for 
Parallel  Programming,  2004.] 


Technology 


Parallel  Languages 

-  OpenMPI,  OpenMP,  UPC,  POSIX,  X10,  Fortress,  Chapel 

Compilers 

-  Intel,  Sun,  Open64 

IDEs 

-  Eclipse  Parallel  Tools  Platform 

Debugging  Tools 

-  TotalView,  DDT,  Thread  Checker,  Thread  Analyzer 

Performance  Tools 

-  PAPI,  TAU 


People 


•  Understand  standard/non-standard  architectures 

•  Learn  parallel  programming/bug  patterns 

•  Comprehend  parallel  language  strengths/weaknesses 

•  Learn  the  process  and  tools 

•  Work  within  multi-disciplinary  teams 


Research  Directions 


Exploiting  Nonstandard  Architectures 

—  Cell  Processors,  GPGPUs,  FPGAs,  accelerators 

Parallel  Programming  Models 

—  Extending  existing  languages  C,  C++,  and  Fortran 

—  New  languages  development:  X10,  Chapel,  Fortress 

—  Hybrid  code  development  (OpenMP/MPI) 

Parallel  Compilers 

—  Code  optimization  and  auto-parallelization 

Productivity  Enhancing  Tools 

—  IDEs,  profiling,  optimization  and  debugging  tools 


Resources 


B.  Chapman,  G.  Jost,  and  R.  Van  Der  Pas.  Using  OpenMP: 
Portable  Shared  Memory  Parallel  Programming.  The  MIT 

Press,  2008. 

T.  G.  Mattson,  B.  A.  Sanders,  and  B.  L.  Massingill.  Patterns  for 

Parallel  Programming.  Addison-Wesley  Professional,  2004. 

cOMPunity,  www.compunity.org 

DoD  HPCMO,  www.hpcmo.hpc.mil 

HPC  Bug  Base,  www.hpcbugbase.org 

HPC  Tools  Group,  http://www2.cs.uh.edu/~hpctools/ 

OpenMP,  www.openmp.org 

OpenMPI,  www.open-mpi.org 


Summary 

Parallel  computing  is  all  around  you! 

Parallel  programming  introduces  more  complex  software  defects 
that  are  hard  to  detect  and  debug 

Parallel  software  performance  requires  attention  to  issues  of 
communications,  synchronization,  scalability  and  load  balance 

Better  processes,  tools  and  training  are  needed  to  improve  the 
practice  and  predictability  of  parallel  software  engineering 

Software  developers  and  acquisition  personnel  should  be  aware 
of  the  opportunities  and  challenges  of  parallel  software 
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Acronym  List 

•  C4ISR  -  Command,  Control,  Communications,  Compute 
Intelligence,  Surveillance,  and  Reconnaissance 

•  DDT  -  Distributed  Debugging  Tool 

•  FPGA  -  Field  Programmable  Gate  Array 

•  GPGPU  -  General  Purpose  Graphics  Processing  Unit 

•  HPC  -  High-Performance  Computing 

•  IDE  -  Integrated  Development  Environment 

•  MPICH  -  Message  Passing  Interface  Chameleon 

•  OpenMP  -  Open  Mulit-Processing 

•  OpenMPI  -  Open  Message  Passing  Interface 

•  PAPI  -  Performance  Application  Programming  Interface 

•  TAU  -  Tuning  and  Analysis  Utilities 

•  UPC  -  Unified  Parallel  C 


